Take-home Exercise 3

**


Time Traveling Through Trade: Visualizing Temporal Patterns to Expose Illegal Fishing

**

Author

Abhishek Singh

Published

May 28, 2023

Modified

June 18, 2023

Overview

FishEye International is a non-profit organization that counters illegal, unreported, and unregulated (IUU) fishing activities. They have recently obtained access to a comprehensive database from an international finance corporation, detailing fishing-related businesses. The database, converted into a knowledge graph, carries valuable information about the companies, their owners, employees, and financial conditions. Traditionally, analysts at FishEye have attempted to uncover business anomalies using standard graph analyses and node-link visualizations. However, the intricate and vast scale of the data has made it challenging to discern the true structure of businesses. Consequently, a more effective visual analytics approach is urgently needed to help identify anomalous companies potentially involved in IUU. This analysis aims to provide a detailed understanding of patterns for entities and their activities over time

Objective

The primary goal of this assignment is to devise a new approach that can efficiently process the large and detailed knowledge graph data to identify anomalies in fishing businesses. This approach should allow us to spot irregular patterns, uncover hidden relationships, and reveal potential IUU-involved companies. By accomplishing this objective, we aim to significantly improve FishEye International’s ability to identify, monitor, and counteract IUU fishing activities.

My TASK

Use visual analytics to identify anomalies in the business groups present in the knowledge graph. Limit your response to 400 words and 5 images.

1. Data Preparation

1.1 Install R packages and import dataset

A Glimpse into the Code
pacman::p_load(jsonlite, igraph, tidygraph, ggraph,
               lubridate, tidyverse, graphlayouts,knitr,plotly, 
               ggthemes,hrbrthemes,treemap,patchwork, ggiraph,
               ggstatsplot, summarytools, ggforce, 
               skimr, tidytext,wordcloud)

The code chunk uses pacman::p_load() to check if packages are installed. If they are, they will be launched into R. The packages used are

  • jsonlite: It is used for working with JSON data in R, providing functions to parse JSON and convert it to data frames.

  • igraph : It offers a wide range of graph algorithms and visualization capabilities

  • tidygraph: An interface for manipulating and analyzing graphs using the principles of tidy data

  • ggraph: It allows for creating aesthetically pleasing and customizable graph visualizations.

  • lubridate: It is a package for working with dates and times in R.

  • ggiraph: used for interactive features such as tooltips, zooming, and panning. It is particularly useful for creating interactive web-based visualizations.

  • hrbrthemes: It provides additional themes and styling options

  • treemap: This package offers functions to create treemaps

  • plotly: Used for creating interactive web-based graphs.

  • ggstatsplot: Used for creating graphics with details from statistical tests.

  • graphlayouts: provides various graph layout algorithms for arranging the nodes and edges of a graph in a visually appealing manner.

  • knitr: Used for dynamic report generation

  • ggdist: Used for visualising distribution and uncertainty

  • ggthemes: Provide additional themes for ggplot2

  • tidyverse: A collection of core packages designed for data science, used extensively for data preparation and wrangling.

  • rstatix: used for data manipulation, summarization, and group-wise comparisons

  • Hmisc : used to compute descriptive statistics for a variable in a dataset

  • DT : DataTables that create interactive table on html page.

  • summarytools- used for creating summary statistics and tables for data exploration and reporting

  • kableExtra- is used for creating tables in various output formats, such as HTML, PDF, or Word documents.

  • ggplot2- provides a flexible and layered approach to create a wide variety of high-quality static and interactive plots.

  • summarytools- used for creating summary statistics and tables for data exploration and reporting

    All packages can be found within CRAN.

pacman::p_load() function from the pacman package is used in the following code chunk to install and call the libraries of multiple R packages:

1.2 Importing data sets

In the code chunk below , fromJSON() of jsonlite package is used to import MC3.json into R environment. The output is called mc3. It is a large list R object.

A Glimpse into the Code
mc3 <- fromJSON("data/MC3.json")

1.3 Extracing Edges

A Glimpse into the Code
MC3_Edges <- as_tibble(mc3$links) %>% 
  distinct() %>%
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type)) %>%
  group_by(source, target, type) %>%
    summarise(weights = n(), .groups = "drop") %>%
  filter(source!=target) %>%
  ungroup()
A Glimpse into the Code
kable(head(MC3_Edges), format = "html", caption = "EDGES")
EDGES
source target type weights
1 AS Marine sanctuary Christina Taylor Company Contacts 1
1 AS Marine sanctuary Debbie Sanders Beneficial Owner 1
1 Ltd. Liability Co Cargo Angela Smith Beneficial Owner 1
1 S.A. de C.V. Catherine Cox Company Contacts 1
1 and Sagl Forwading Angela Mendoza Company Contacts 1
1 and Sagl Forwading Christopher Watson Beneficial Owner 1
Note
  • distinct() is used to ensure that there will be no duplicated records.
  • mutate() and as.character() are used to convert the field data type from list to character.
  • group_by() and summarise() are used to count the number of unique links.
  • the filter (source!=target) is to ensure that no record with similar source and target.
A Glimpse into the Code
skim(MC3_Edges)
Data summary
Name MC3_Edges
Number of rows 24036
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1 6 700 0 12856 0
target 0 1 6 28 0 21265 0
type 0 1 16 16 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weights 0 1 1 0 1 1 1 1 1 ▁▁▇▁▁
A Glimpse into the Code
str(MC3_Edges)
tibble [24,036 × 4] (S3: tbl_df/tbl/data.frame)
 $ source : chr [1:24036] "1 AS Marine sanctuary" "1 AS Marine sanctuary" "1 Ltd. Liability Co Cargo" "1 S.A. de C.V." ...
 $ target : chr [1:24036] "Christina Taylor" "Debbie Sanders" "Angela Smith" "Catherine Cox" ...
 $ type   : chr [1:24036] "Company Contacts" "Beneficial Owner" "Beneficial Owner" "Company Contacts" ...
 $ weights: int [1:24036] 1 1 1 1 1 1 1 1 1 1 ...
A Glimpse into the Code
DT::datatable(MC3_Edges, class= "compact", filter='top')
A Glimpse into the Code
Hmisc::describe(MC3_Edges)
MC3_Edges 

 4  Variables      24036  Observations
--------------------------------------------------------------------------------
source 
       n  missing distinct 
   24036        0    12856 

lowest : 1 and Sagl Forwading        1 AS Marine sanctuary       1 Ltd. Liability Co Cargo   1 S.A. de C.V.              2 Limited Liability Company
highest: zūn yú GmbH & Co. KG Creek  zūn yú N.V. Shipping        zūn yú S.A. de C.V.         Zuniga-Young                Zuniga and Sons            
--------------------------------------------------------------------------------
target 
       n  missing distinct 
   24036        0    21265 

lowest : Aaron Adams   Aaron Adkins  Aaron Allen   Aaron Alvarez Aaron Baker  
highest: Zachary York  Zachary Young Zoe Allen     Zoe Marsh     Zoe Smith    
--------------------------------------------------------------------------------
type 
       n  missing distinct 
   24036        0        2 
                                            
Value      Beneficial Owner Company Contacts
Frequency             16792             7244
Proportion            0.699            0.301
--------------------------------------------------------------------------------
weights 
       n  missing distinct     Info     Mean      Gmd 
   24036        0        1        0        1        0 
                
Value          1
Frequency  24036
Proportion     1
--------------------------------------------------------------------------------

Checking Missing Values:

A Glimpse into the Code
colSums(is.na(MC3_Edges))
 source  target    type weights 
      0       0       0       0 

Checking Duplicates

A Glimpse into the Code
any(duplicated(MC3_Edges))
[1] FALSE
  • The dataset comprises of an undirected multi-graph with 27,622 nodes and 24,038 edges.
  • It contains 7,794 connected components.
  • The graph is undirected, implying that relationships or interactions do not have a specific direction or order. In other words, if there is a connection between two nodes, it applies both ways.

Edge Attributes:

  • type: This attribute represents the type or nature of the relationship or interaction between the nodes connected by the edge.
  • source: This is the ID of the source node. It identifies where the relationship or interaction originates from in the network.
  • target: This is the ID of the target node. It identifies where the relationship or interaction is directed towards in the network.
  • role: This provides a more specific classification of the relationship or interaction represented by the edge, like beneficial owner or company contacts.
A Glimpse into the Code
MC3_Edges_count <- MC3_Edges %>%
  group_by(type) %>%
  summarise(n = n())


p <- ggplot(data = MC3_Edges_count, aes(x = type, y = n, fill = type)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = n), vjust = -0.5) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(plot.background = element_rect(fill = "seashell"),
        panel.grid.major = element_line(color = "grey80"),
        panel.grid.minor = element_blank(),
        legend.position = "top",
        text = element_text(size = 12, face = "bold"),
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "Type", y = "Count", fill = "Type",
       title = "Distribution of Edge Types")

ggplotly(p)

1.3 Extracting Nodes

A Glimpse into the Code
MC3_Nodes <- as_tibble(mc3$nodes) %>%
  mutate(country = as.character(country),
         id = as.character(id),
         product_services = as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type = as.character(type)) %>%
  select(id, country, type, revenue_omu, product_services)
A Glimpse into the Code
kable(head(MC3_Nodes), format = "html", caption = "NODES")
NODES
id country type revenue_omu product_services
Jones LLC ZH Company 310612303 Automobiles
Coleman, Hall and Lopez ZH Company 162734684 Passenger cars, trucks, vans, and buses
Aqua Advancements Sashimi SE Express Oceanus Company 115004667 Holding firm whose subsidiaries are engaged in the businesses of refining and chemicals, process and pollution control equipment, minerals, fertilizers, polymers and fibers, commodity trading and services, forest and consumer products, and ranching
Makumba Ltd. Liability Co Utoporiana Company 90986413 Car service, car parts and accessories, automotive technology, diagnostics for repair shops, antilock braking and fuel-injection systems, auto electronics, starters, and alternators; Home (power tools for DIY enthusiasts, garden tools, household appliances, heating and warm water); and industry and trade (communication services, power tools for professional, sensors and foundry - MEMS, security systems, packaging technology)
Taylor, Taylor and Farrell ZH Company 81466667 Fully electric vehicles (EVs) and electric vehicle powertrain components
Harmon, Edwards and Bates ZH Company 75070435 Discount supermarket; Variety of food and non-food products
Note
  • mutate() and as.character() are used to convert the field data type from list to character.
  • To convert revenue_omu from list data type to numeric data type, we need to convert the values into character first by using as.character(). Then, as.numeric() will be used to convert them into numeric data type.
  • select() is used to re-organise the order of the fields.
A Glimpse into the Code
skim(MC3_Nodes)
Data summary
Name MC3_Nodes
Number of rows 27622
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 6 64 0 22929 0
country 0 1 2 15 0 100 0
type 0 1 7 16 0 3 0
product_services 0 1 4 1737 0 3244 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
revenue_omu 21515 0.22 1822155 18184433 3652.23 7676.36 16210.68 48327.66 310612303 ▇▁▁▁▁
A Glimpse into the Code
str(MC3_Nodes)
tibble [27,622 × 5] (S3: tbl_df/tbl/data.frame)
 $ id              : chr [1:27622] "Jones LLC" "Coleman, Hall and Lopez" "Aqua Advancements Sashimi SE Express" "Makumba Ltd. Liability Co" ...
 $ country         : chr [1:27622] "ZH" "ZH" "Oceanus" "Utoporiana" ...
 $ type            : chr [1:27622] "Company" "Company" "Company" "Company" ...
 $ revenue_omu     : num [1:27622] 3.11e+08 1.63e+08 1.15e+08 9.10e+07 8.15e+07 ...
 $ product_services: chr [1:27622] "Automobiles" "Passenger cars, trucks, vans, and buses" "Holding firm whose subsidiaries are engaged in the businesses of refining and chemicals, process and pollution "| __truncated__ "Car service, car parts and accessories, automotive technology, diagnostics for repair shops, antilock braking a"| __truncated__ ...
A Glimpse into the Code
DT::datatable(MC3_Nodes, class= "compact", filter='top')
A Glimpse into the Code
Hmisc::describe(MC3_Nodes)
MC3_Nodes 

 5  Variables      27622  Observations
--------------------------------------------------------------------------------
id 
       n  missing distinct 
   27622        0    22929 

lowest : 1 and Sagl Forwading          1 AS Marine sanctuary         1 Eel Corporation Transport   1 Ltd. Corporation Transport  1 Ltd. Liability Co          
highest: Zuniga Inc                    Zuniga Ltd                    Zuniga PLC                    Zuniga, Burgess and Davenport Zuniga, Logan and Newton     
--------------------------------------------------------------------------------
country 
       n  missing distinct 
   27622        0      100 

lowest : Afarivaria      Alverossia      Alverovia       Andenovia       Anderia del Mar
highest: Wysterion       Yggdrasonia     Zambarka        Zawalinda       ZH             
--------------------------------------------------------------------------------
type 
       n  missing distinct 
   27622        0        3 
                                                             
Value      Beneficial Owner          Company Company Contacts
Frequency             11949             8639             7034
Proportion            0.433            0.313            0.255
--------------------------------------------------------------------------------
revenue_omu 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
    6107    21515     4637        1  1822155  3574819     4915     5243 
     .25      .50      .75      .90      .95 
    7676    16211    48328   190919   612716 

lowest : 3.652227e+03 4.657797e+03 4.660665e+03 4.666673e+03 4.666703e+03
highest: 2.914265e+08 2.929701e+08 3.049959e+08 3.082496e+08 3.106123e+08
--------------------------------------------------------------------------------
product_services 
       n  missing distinct 
   27622        0     3244 

lowest : (Italian) peeled tomatoes, legumes, vegetables, fruits and canned mushrooms                                                                                                                                                         100 percent Spanish olives; peppers, green, black, and manzanilla stuffed olives; anchovies-stuffed olives; and black olives; Olives recipes                                                                                        2 or 3-piece containers, twist off caps, easy opening and traditional caps; Cutting; varnishing and metal plate lithography                                                                                                         8 Cement Mixer Units, Ocean Freight, Air Freight, Project Logistics, Continental Container Line CCL, Atlantic Pacific Container line APL, Project Arabia Line PAL source: freelance researcher                                      A chemical science firm with a focus on the development of high purity, high performance products and services                                                                                                                     
highest: Young soybeans in pods; and spring rolls includes shrimp mini spring rolls with shiitake mushroom, vegetable mini spring rolls with shiitake mushroom, and all natural pre-fried vegetable mini spring rolls with shiitake mushroom Zinc and aluminum die cast hardware and components                                                                                                                                                                                  Zinc and aluminum die cast parts                                                                                                                                                                                                    Zinc metal                                                                                                                                                                                                                          Zumba clothing and accessories                                                                                                                                                                                                     
--------------------------------------------------------------------------------

Checking Missing Values:

A Glimpse into the Code
colSums(is.na(MC3_Nodes))
              id          country             type      revenue_omu 
               0                0                0            21515 
product_services 
               0 

Checking Duplicates

A Glimpse into the Code
any(duplicated(MC3_Nodes))
[1] TRUE
  • The dataset comprises of an undirected multi-graph with 27,622 nodes and 24,038 edges.
  • It contains 7,794 connected components.
  • The graph is undirected, implying that relationships or interactions do not have a specific direction or order. In other words, if there is a connection between two nodes, it applies both ways.

Node Attributes:

  • type: The classification or category of the node. This can indicate the nature of the entity, such as company, owner, or worker.
  • country: This attribute represents the country associated with the node. This can be either a full country name or a two-letter country code.
  • product_services: This provides a description of the products or services associated with the node. This can help in understanding the node’s role in the network.
  • revenue_omu: This is the operating revenue of the node in Oceanus Monetary Units (OMU). It gives a measure of the financial size or activity of the node.
  • id: This is the unique identifier of the node. This ID is also the name of the entity it represents.
  • role: This is a subset of the “type” attribute, providing more detailed classification of the node. It includes roles like beneficial owner or company contacts.

#3 Data Visualization

A Glimpse into the Code
MC3_Nodes_count <- MC3_Nodes %>%
  group_by(type) %>%
  summarise(n = n())


p <- ggplot(data = MC3_Nodes_count, aes(x = type, y = n, fill = type)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = n), vjust = -0.5) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(plot.background = element_rect(fill = "seashell"),
        panel.grid.major = element_line(color = "grey80"),
        panel.grid.minor = element_blank(),
        legend.position = "top",
        text = element_text(size = 12, face = "bold"),
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "Type", y = "Count", fill = "Type",
       title = "Distribution of Edge Types")

ggplotly(p)

2. Visualization

2.1 Top 10 Countries with Highest revenue

A Glimpse into the Code
# Group the data by country and calculate the total revenue
top_countries <- MC3_Nodes %>%
  group_by(country) %>%
  summarise(total_revenue = sum(revenue_omu, na.rm = TRUE)) %>%
  arrange(desc(total_revenue)) %>%
  head(10)

# Plot the top 10 countries by total revenue
p <- ggplot(data = top_countries, aes(x = reorder(country, -total_revenue), y = total_revenue)) +
  geom_bar(stat = "identity") +
  #geom_text(aes(label = round(total_revenue)), vjust = -0.5) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(plot.background = element_rect(fill = "seashell"),
        panel.grid.major = element_line(color = "grey80"),
        panel.grid.minor = element_blank(),
        legend.position = "top",
        text = element_text(size = 12, face = "bold"),
        plot.title = element_text(hjust = 0.5)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(x = "Country", y = "Total Revenue (OMU)", fill = "Country",
       title = "Top 10 Countries by Total Revenue")

# Convert ggplot object to a plotly object for interactivity
p_interactive <- ggplotly(p)

p_interactive

2.2 Number of Edges (connections) per Node

A Glimpse into the Code
# fuction from igraph-> graph_from_data_frame
g <- graph_from_data_frame(MC3_Edges, directed = FALSE)

# Calculation of  degrees
node_degrees <- degree(g)

# Converting to dataframe
df_degrees <- data.frame(node = names(node_degrees), degree = node_degrees)

# Histogram
p <- ggplot(df_degrees, aes(x = degree)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
  xlim(0, 6) +
  theme_minimal() +
  labs(x = "Degree", y = "Count", title = "Distribution of Node Degrees")

gp <- ggplotly(p)

gp
Note
  • The majority of the nodes in the network graph have a degree of 1. This means that most entities in the network only have one connection with other entities. A count of 29,229 signifies a substantial proportion of the total nodes.

  • As the degree increases, the number of nodes that hold that degree decreases substantially. This trend signifies that it’s less common for entities to have multiple connections in the network. Nodes with a degree of 2 are 2,526. This number is significantly less than those with a degree of 1, indicating that fewer entities have two connections.

  • Further decrease is observed for nodes with degrees 3, 4, and 5, having counts of 1,100, 447, and 257 respectively. This consistent decline suggests that entities with many connections are quite rare in this network.

  • Lastly, entities with a degree of 5 are the rarest in the network. It may indicate highly connected entities or potential hubs in the network. Overall, the degree distribution of this network suggests a sparse and potentially disconnected network structure, which might present challenges in identifying broad structural anomalies. However, it also helps highlight entities with higher degrees as potential points of interest.

2.3 Proportion of Nodes in each ‘Country’

A Glimpse into the Code
# Calculating the number of nodes in each country
country_nodes <- MC3_Nodes %>%
  count(country) %>%
  arrange(desc(n)) %>%
  head(10)


p1 <- ggplot(country_nodes, aes(reorder(country, -n), n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Top 10 Countries by Node Count", x = "Country", y = "Node Count") +
  coord_flip() +
  theme_minimal() +
  theme(plot.background = element_rect(fill = "seashell"),
        panel.grid.major = element_line(color = "grey80"),
        panel.grid.minor = element_blank(),
        legend.position = "top",
        text = element_text(size = 12, face = "bold"),
        plot.title = element_text(hjust = 0.5)) 

gp1 <- ggplotly(p1)

gp1
Note
  • The country with the most nodes in the graph is ZH, accounting for 22,439 nodes. This significant concentration indicates that ZH is a major player within the network and likely plays a crucial role in the industry.

  • The second most represented country is Oceanus, with 2,143 nodes. While this is considerably less than ZH, it still represents a substantial number of nodes and suggests that Oceanus also holds a significant position within the network.

  • The third most represented country is Marebak, with 742 nodes. Despite having less than a third of the nodes compared to Oceanus and a considerably smaller number compared to ZH, Marebak still has a noteworthy presence within the network.

  • Overall, these results suggest a significant concentration of nodes within a few countries, specifically ZH, Oceanus, and Marebak. This could potentially indicate centralization of activities within these regions. Future investigations could help in understanding what specific roles these countries play in the network, and how their large presence may impact the dynamics of the entire network.

2.4 Centrality

A Glimpse into the Code
id1 <- MC3_Edges %>%
  select(source) %>%
  rename(id = source)
id2 <- MC3_Edges %>%
  select(target) %>%
  rename(id = target)
MC3_Nodes1 <- rbind(id1, id2) %>%
  distinct() %>%
  left_join(MC3_Nodes,
            unmatched = "drop")
A Glimpse into the Code
MC3_Graph <- tbl_graph(nodes = MC3_Nodes1,
                       edges = MC3_Edges,
                       directed = FALSE) %>%
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness())
A Glimpse into the Code
MC3_Graph %>%
  filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
  geom_edge_link(aes(alpha=0.5)) +
  geom_node_point(aes(
    size = betweenness_centrality,
    colors = "lightblue",
    alpha = 0.5)) +
  scale_size_continuous(range=c(1,10))+
  theme_graph(background = "seashell")

2.5 Top 5 Country by Revenue

A Glimpse into the Code
top_5 <- MC3_Nodes %>%
  group_by(country) %>%
  summarise(total_revenue = sum(revenue_omu, na.rm = TRUE)) %>%
  arrange(desc(total_revenue)) %>%
  head(5)
# Filtering
top_countries_5 <- MC3_Nodes[MC3_Nodes$country %in% top_5$country, ]

# Grouping by country and company, and calculating total revenue per company
top_countries_5 <- top_countries_5 %>%
  filter(type=="Company") %>%
  group_by(country, id) %>%
  summarise(company_revenue = sum(revenue_omu, na.rm = TRUE), .groups = "drop") %>%
  arrange(country, desc(company_revenue))

# For each country, keep only the company with the highest revenue
top_countries_5 <- top_countries_5 %>%
  group_by(country) %>%
  slice_max(order_by = company_revenue, n = 5)


treemap(top_countries_5,
        index = c("country", "id"),
        vSize = "company_revenue",
        vColor = "company_revenue",
        palette = "Paired",
        border.lwds = 2,
        border.col = "white",
        title = "Top Companies by Revenue in Top 5 Countries",
        fontsize.labels = c(14, 10),
        fontfamily.labels = "Arial", 
        fontcolor.labels = c("white", "black"),
        align.labels = list(
                      c("center", "center"),
                      c("left", "top")
        ), 
        position.legend = "bottom"
    
)

Note

Below treemap provides a visual representation of the companies that generate the highest revenues in their respective countries.The top 5 countries were selected based on total revenue. In these selected countries, companies were further sorted and the top revenue-generating companies were identified.

The findings from the treemap are as follows:

  • The majority of the highest revenue-generating companies are registered in the country labeled as ‘ZH’.
  • Among these, the top 3 companies in terms of revenue generated have been identified as ‘Jones LLC’, ‘Patton Ltd’, and ‘Ramirez,Gallaghar and Jhonson’ Group.
  • The dataset also indicates that in the ‘Utoporiana’ and ‘Oceanus’ countries, the ‘Assam Limited Liability Company’ and ‘Aqua Advancements Sashimi SE Express’ are the top revenue earners respectively.

2.5 Tokenization

Calculating number of times the word fish appeared in the field product_services.

A Glimpse into the Code
MC3_Nodes %>% 
    mutate(n_fish = str_count(product_services, "fish")) 
# A tibble: 27,622 × 6
   id                          country type  revenue_omu product_services n_fish
   <chr>                       <chr>   <chr>       <dbl> <chr>             <int>
 1 Jones LLC                   ZH      Comp…  310612303. Automobiles           0
 2 Coleman, Hall and Lopez     ZH      Comp…  162734684. Passenger cars,…      0
 3 Aqua Advancements Sashimi … Oceanus Comp…  115004667. Holding firm wh…      0
 4 Makumba Ltd. Liability Co   Utopor… Comp…   90986413. Car service, ca…      0
 5 Taylor, Taylor and Farrell  ZH      Comp…   81466667. Fully electric …      0
 6 Harmon, Edwards and Bates   ZH      Comp…   75070435. Discount superm…      0
 7 Punjab s Marine conservati… Riodel… Comp…   72167572. Beef, pork, chi…      0
 8 Assam   Limited Liability … Utopor… Comp…   72162317. Power and Gas s…      0
 9 Ianira Starfish Sagl Import Rio Is… Comp…   68832979. Light commercia…      0
10 Moran, Lewis and Jimenez    ZH      Comp…   65592906. Automobiles, tr…      0
# ℹ 27,612 more rows

Tokenisation is the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenisation, some characters like punctuation marks may be discarded. The tokens usually become the input for the processes like parsing and text mining.

In the code chunk below, unnest_token() of tidytext is used to split text in product_services field into words.

A Glimpse into the Code
token_nodes <- MC3_Nodes %>%
  unnest_tokens(word, 
                product_services)

Top 15 words:

A Glimpse into the Code
p <- token_nodes %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "Count",
       y = "Unique words",
       title = "Count of unique words found in product_services field") +
  theme(plot.background = element_rect(fill = "seashell"))

ggplotly(p)

The bar chart reveals that the unique words contains some words that may not be useful to use. For instance “a” and “to”. In the word of text mining we call those words stop words. You want to remove these words from your analysis as they are fillers used to compose a sentence.

The tidytext package has a function called stop_words that will help us clean up stop words.

A Glimpse into the Code
stopwords_removed <- token_nodes %>% 
  anti_join(stop_words)
Tip

There are two processes:

  • Load the stop_words data included with tidytext. This data is simply a list of words that you may want to remove in a natural language analysis.
  • Then anti_join() of dplyr package is used to remove all stop words from the analysis.

Checking the Top50 words and their counts

A Glimpse into the Code
top_50 <- stopwords_removed %>%
  count(word, sort = TRUE) %>%
  top_n(50)

print(top_50)
# A tibble: 50 × 2
   word          n
   <chr>     <int>
 1 0         18959
 2 character 18959
 3 unknown    4645
 4 products   1860
 5 fish        740
 6 seafood     622
 7 frozen      467
 8 services    429
 9 food        345
10 related     329
# ℹ 40 more rows

Removing the unwanted words like 0, Character and Unknown from the stopwords_removed

A Glimpse into the Code
filtered_words <- stopwords_removed %>%
  filter(!(word %in% c("0", "character", "unknown")))

Checking Top 50 words in filtered Words

A Glimpse into the Code
top_50 <- filtered_words %>%
  count(word, sort = TRUE) %>%
  top_n(50)

print(top_50)
# A tibble: 51 × 2
   word          n
   <chr>     <int>
 1 products   1860
 2 fish        740
 3 seafood     622
 4 frozen      467
 5 services    429
 6 food        345
 7 related     329
 8 equipment   309
 9 fresh       276
10 salmon      252
# ℹ 41 more rows
A Glimpse into the Code
set.seed(1234)
filtered_words %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 50))

A Glimpse into the Code
p <- filtered_words %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "Count",
       y = "Unique words",
       title = "Count of unique words found in product_services field") +
  theme(plot.background = element_rect(fill = "seashell"))

ggplotly(p)

Betweenness Centrality on top 15 Words

In our nodes dataset, we have a unique column named ‘product_services’, which isn’t available in the edges dataset. To perform our analysis, we need to consider 15 specific words having highest count and identify the nodes where these words are mentioned in the ‘product_services’ column.

After identifying and filtering these particular nodes, we’ll utilize them as a reference for filtering our edges dataset. Specifically, we’ll only keep the edges where the ‘source’ or ‘target’ matches with the ID of our filtered nodes. This method allows us to create a network subset that’s related to our specific words from the ‘product_services’ column.

A Glimpse into the Code
top_words <- c("products", "fish", "seafood", "frozen", "services", 
               "food", "related", "equipment", "fresh", "salmon", 
               "accessories", "materials", "systems", "freight") 

# Filtering nodes that contain the top words in the product_services column
MC3_NodesFilter <- MC3_Nodes %>% 
  filter(str_detect(product_services, paste(top_words, collapse = "|")))

# Filtering  edges where the source or target is in the filtered nodes
# Filtering edges where the source or target is in the filtered nodes
MC3_EdgeFilter <- MC3_Edges %>% 
  filter(source %in% MC3_NodesFilter$id | target %in% MC3_NodesFilter$id)
A Glimpse into the Code
id1 <- MC3_EdgeFilter %>%
  select(source) %>%
  rename(id = source)
id2 <- MC3_EdgeFilter %>%
  select(target) %>%
  rename(id = target)
MC3_Nodes1 <- rbind(id1, id2) %>%
  distinct() %>%
  left_join(MC3_Nodes,
            unmatched = "drop")


MC3_G <- tbl_graph(nodes = MC3_Nodes1,
                       edges = MC3_EdgeFilter,
                       directed = FALSE)%>%
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness()
        
  )

Betweenness centrality

Betweenness centrality measures the number of times a node acts as a bridge along the shortest path between two other nodes. It is useful for identifying nodes that serve as a connector or broker within a network. In illegal fishing, a node with high betweenness centrality might represent a key intermediary, such as a specific ship or company that’s heavily involved in transporting or selling illegal catch.

A Glimpse into the Code
degrees <- degree(MC3_G, mode = "all")

MC3_G_filtered <- MC3_G %>%
  activate(nodes) %>%
  filter(betweenness_centrality >= 10000)  

MC3_G_filtered %>%
  activate(nodes) %>%
  mutate(community = as.factor(membership(cluster_louvain(.)))) %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = 0.5)) +
  geom_node_point(aes(size = betweenness_centrality,
                      color = community,  # Use the community variable for color
                      alpha = 0.5), show.legend = TRUE) +
  scale_size_continuous(range = c(1, 10)) +
  labs(title = "Betweenness centrality") + 
  theme(plot.background = element_rect(fill = "seashell"))

2.6 Voilin Plot Type by Revenue

The violin plot visualizes the distribution of a numerical variable (revenue_omu) across different categories (type). It provides information on the central tendency, variability, and distributional shape of the revenue data for each type.

A Glimpse into the Code
p <- ggplot(MC3_Nodes1, aes(x = type, y = revenue_omu)) +
  geom_violin(trim = FALSE) +
  labs(x = "Type", y = "Revenue OMU") +
  theme(plot.background = element_rect(fill = "seashell")) +
  scale_y_continuous(labels = scales::comma) +
  coord_flip()


plotly::ggplotly(p)

Additionally, created another violin plot specifically for ‘beneficial owner’ type because it has more revenue than the rest, this would allow a more detailed examination of the revenue distribution for this specific type.

A Glimpse into the Code
# Filter data
MC3_Nodes1_filtered <- MC3_Nodes1 %>%
  filter(type %in% c("Beneficial Owner"))

# Create the violin plot
p <- ggplot(MC3_Nodes1_filtered, aes(x = type, y = revenue_omu)) +
  geom_violin(trim = FALSE) +
  labs(x = "Type", y = "Revenue OMU") +
  theme(plot.background = element_rect(fill = "seashell")) +
  scale_y_continuous(labels = scales::comma) +
  coord_flip()

# Convert to interactive plot
plotly::ggplotly(p)

Recommendations, Limitations and Takeaways

RECOMMENDATIONS

  • Deep Dive into Entities with High Degrees: Given the sparsity of the network, entities with higher degrees can be seen as significant connectors. A deeper dive into these entities could provide more valuable insights. What type of entities are they? How do they connect different parts of the network? What role do they play in the context of fishing business and potential IUU activities?

  • Country-Specific Analysis: Given the concentration of nodes in a few countries (especially ZH), it would be valuable to conduct a more detailed country-specific analysis. Understanding the specific roles these countries play in the network and how their large presence impacts the dynamics of the entire network could provide valuable insights.

  • Revenue-Based Analysis: The treemap visualization and violin plots provided valuable insights into the revenue patterns across different companies and types of entities. A more detailed revenue-based analysis could be performed, exploring the relationship between revenue and other attributes or network properties.

LIMITATIONS

  • Network Sparsity: The network appears to be quite sparse, potentially indicating a disconnected network structure. This might present challenges in identifying broad structural anomalies or overarching patterns.

  • `Limited Attributes for Analysis:The lack of attributes limit the depth and breadth of the analysis. For example, attributes related to the nature and volume of fishing activities, legal status, historical data, etc., could have provided additional dimensions for analysis.

KEY TAKEAWAYS

  • Significance of Network Measures: Network measures such as degree and betweenness centrality can provide valuable insights into the roles and importance of nodes within a network. High-degree nodes and nodes with high betweenness centrality can be of particular interest in the context of IUU fishing activities.

  • Role of Textual Data: The analysis also highlighted the potential of textual data. The use of specific words in the ‘product_services’ column allowed for a more targeted analysis and extraction of a relevant subset of the network.

  • Importance of Revenue Analysis: The analysis of revenue data revealed patterns and anomalies that can be indicative of potential IUU activities. Companies generating disproportionately high revenues and the revenue patterns of specific types of entities are worth further investigation.